Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

User-configurable OCR enhancement for online natural history archives

Identifieur interne : 000F19 ( Main/Exploration ); précédent : 000F18; suivant : 000F20

User-configurable OCR enhancement for online natural history archives

Auteurs : Andy Downton [Royaume-Uni] ; JINGYU HE [Royaume-Uni] ; Simon Lucas [Royaume-Uni]

Source :

RBID : Pascal:07-0469293

Descripteurs français

English descriptors

Abstract

The creation of structured digital libraries from paper-based archives is an area of growing demand in many scientific and cultural fields, and is not satisfied either by off-the-shelf OCR or commercial form-processing systems. This paper describes and evaluates a configurable archive construction system, which integrates document image pre-processing and analysis with text post-processing tools and a standard OCR package to meet digital archiving requirements. The prototype system is currently being used in conjunction with the UK Natural History Museum to help convert more than 500,000 cards of Lepidoptera (Butterflies and Moths) and Coleoptera (Beetles) to searchable digital archives. Evaluation results covering different aspects of the system from card scanning to overall word recognition rates for different database fields are summarised for two datasets comprising over 5,000 cards selected from different parts of these archives. First-pass end-to-end word recognition rates of 70-90% are reported for key data fields, subject to availability of suitable electronic dictionaries. Further validation and correction is supported through web-editing of the online digital archive.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">User-configurable OCR enhancement for online natural history archives</title>
<author>
<name sortKey="Downton, Andy" sort="Downton, Andy" uniqKey="Downton A" first="Andy" last="Downton">Andy Downton</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park</s1>
<s2>Colchester, C04 3SQ</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>Colchester, C04 3SQ</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Jingyu He" sort="Jingyu He" uniqKey="Jingyu He" last="Jingyu He">JINGYU HE</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park</s1>
<s2>Colchester, C04 3SQ</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>Colchester, C04 3SQ</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Lucas, Simon" sort="Lucas, Simon" uniqKey="Lucas S" first="Simon" last="Lucas">Simon Lucas</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park</s1>
<s2>Colchester, C04 3SQ</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>Colchester, C04 3SQ</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">07-0469293</idno>
<date when="2007">2007</date>
<idno type="stanalyst">PASCAL 07-0469293 INIST</idno>
<idno type="RBID">Pascal:07-0469293</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000320</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000466</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000261</idno>
<idno type="wicri:doubleKey">1433-2833:2007:Downton A:user:configurable:ocr</idno>
<idno type="wicri:Area/Main/Merge">000F32</idno>
<idno type="wicri:Area/Main/Curation">000F19</idno>
<idno type="wicri:Area/Main/Exploration">000F19</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">User-configurable OCR enhancement for online natural history archives</title>
<author>
<name sortKey="Downton, Andy" sort="Downton, Andy" uniqKey="Downton A" first="Andy" last="Downton">Andy Downton</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park</s1>
<s2>Colchester, C04 3SQ</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>Colchester, C04 3SQ</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Jingyu He" sort="Jingyu He" uniqKey="Jingyu He" last="Jingyu He">JINGYU HE</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park</s1>
<s2>Colchester, C04 3SQ</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>Colchester, C04 3SQ</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Lucas, Simon" sort="Lucas, Simon" uniqKey="Lucas S" first="Simon" last="Lucas">Simon Lucas</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Department of Electronic Systems Engineering, University of Essex, Wivenhoe Park</s1>
<s2>Colchester, C04 3SQ</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>Colchester, C04 3SQ</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint>
<date when="2007">2007</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Availability</term>
<term>Character recognition</term>
<term>Construction system</term>
<term>Data field</term>
<term>Database</term>
<term>Digital archive</term>
<term>Document analysis</term>
<term>Document image processing</term>
<term>Electronic dictionary</term>
<term>Electronic library</term>
<term>Internet</term>
<term>Museum</term>
<term>Natural language</term>
<term>Optical character recognition</term>
<term>Text analysis</term>
<term>Validation</term>
<term>Word</term>
<term>World wide web</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Bibliothèque électronique</term>
<term>Traitement image document</term>
<term>Mot</term>
<term>Langage naturel</term>
<term>Base donnée</term>
<term>Zone donnée</term>
<term>Disponibilité</term>
<term>Internet</term>
<term>Analyse documentaire</term>
<term>Archive électronique</term>
<term>Système construction</term>
<term>Analyse texte</term>
<term>Musée</term>
<term>Réseau web</term>
<term>Validation</term>
<term>Dictionnaire électronique</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Base de données</term>
<term>Musée</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The creation of structured digital libraries from paper-based archives is an area of growing demand in many scientific and cultural fields, and is not satisfied either by off-the-shelf OCR or commercial form-processing systems. This paper describes and evaluates a configurable archive construction system, which integrates document image pre-processing and analysis with text post-processing tools and a standard OCR package to meet digital archiving requirements. The prototype system is currently being used in conjunction with the UK Natural History Museum to help convert more than 500,000 cards of Lepidoptera (Butterflies and Moths) and Coleoptera (Beetles) to searchable digital archives. Evaluation results covering different aspects of the system from card scanning to overall word recognition rates for different database fields are summarised for two datasets comprising over 5,000 cards selected from different parts of these archives. First-pass end-to-end word recognition rates of 70-90% are reported for key data fields, subject to availability of suitable electronic dictionaries. Further validation and correction is supported through web-editing of the online digital archive.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Royaume-Uni</li>
</country>
</list>
<tree>
<country name="Royaume-Uni">
<noRegion>
<name sortKey="Downton, Andy" sort="Downton, Andy" uniqKey="Downton A" first="Andy" last="Downton">Andy Downton</name>
</noRegion>
<name sortKey="Jingyu He" sort="Jingyu He" uniqKey="Jingyu He" last="Jingyu He">JINGYU HE</name>
<name sortKey="Lucas, Simon" sort="Lucas, Simon" uniqKey="Lucas S" first="Simon" last="Lucas">Simon Lucas</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F19 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000F19 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:07-0469293
   |texte=   User-configurable OCR enhancement for online natural history archives
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024